The Best Trail Algorithm for Assisted 
Navigation of Web Sites 



CO 

o 

O 

CN 

C 

CN 
CN 



00 

q 

O 



> 
CN 
CN 

o 
m 
o 

o 



X 
S3 



Richard Wheeldon 

School of Computer Science and Information 
Systems 
Birkbeck University of London 

Malet St, London 
WC1 E 7HX, United Kingdom 

richard@dcs.bbk.ac.uk 



ABSTRACT 

We present an algorithm called the Best Trail Algorithm, 
which helps solve the hypertext navigation problem by au- 
tomating the construction of memex-like trails through the 
corpus. The algorithm performs a probabilistic best-first 
expansion of a set of navigation trees to find relevant and 
compact trails. We describe the implementation of the al- 
gorithm, scoring methods for trails, filtering algorithms and 
a new metric called potential gain which measures the po- 
tential of a page for future navigation opportunities. 

1. INTRODUCTION 

The World Wide Web is a massive global hypertext sys- 
tem in which documents (or pages) can be found on almost 
every subject imaginable. These pages are made available 
by many authors and written in many languages. We con- 
sider a web site to represent a collection of pages with some 
common element, such as topic, author or institution. The 
process of navigation or surfing is that of following links 
according to the topology of the web site and viewing (or 
browsing) the contents of visited pages. During the naviga- 
tion process users may become "lost in hyperspace" , mean- 
ing that they become disoriented 1281 . This happens when 
users fail to understand the context of the pages they are 
viewing, are unsure of how they reached a page, cannot see 
how the page is related to key pages such as the homepage 
or are uncertain as to where they should proceed to find the 
information they are looking for 1231 . 

Vannevar Bush envisaged a hypothetical machine called a 
meraex |5j - a cabinet-like box into which the user could store 
documents and images. A sequence of such documents could 
then be annotated and linked together to form a trail. By 
continuing the process, Bush imagined that future workers 
could build a "web of trails". 

In Berners-Lee's Web, a trail or navigation path is implic- 
itly formed as the result of a navigation session in which 
the user visits a sequence of web pages. Previous research 
Q has shown how the trails which users follow can be ex- 
tracted from log data. Often the starting point for one of 
these trails is a page resulting from a search request |29| . yet 
existing site search engines will neither consider the possi- 
bilities for future navigation when returning their result nor 
present details of the paths users might follow. 
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It is our hypothesis that constructing trails or paths in 
a query-dependant manner will provide contextual informa- 
tion that will reduce the effects of the navigation problem 
and increase user-satisfaction during search tasks. Our con- 
tribution is to describe a probabilistic best-first algorithm 
for automating the discovery of memex-like trails from a set 
of starting points. We describe metrics for evaluating trails, 
and introduce a new metric for determining more effective 
starting points by evaluating the potential gain of future 
navigation from a given page. Previous hypertext systems 
have featured the ability to manipulate trails manually I I 21 
1361 or allowed the construction of trails using pure IR 
metrics |Hll6|. However, none of these systems has allowed 
the automatic construction of trails by the computer in any 
way that takes account of hyperlinks. 

The rest of this paper is organized as follows: In section [5] 
we describe our system for computing trails - selecting start- 
ing points using the potential gain metric, expanding the 
trails using the Best Trail algorithm and filtering redundant 
information from them with heuristic methods. In section [3] 
we describe our preliminary efforts to evaluate the utility 
of the navigation engine which uses these trails to assist 
users |24| . In section [I] we describe our implementation of 
the algorithm. In section |S] we describe experiments into 
the behaviour and performance of this implementation. We 
discuss related work in section |S] and give our concluding 
remarks and directions for future research in section [7] 

2. COMPUTING TRAILS 

In this section we outline our methodology for computing 
trails. Trails are computed by selecting relevant starting 
points, expanding a navigation tree from each node using the 
Best Trail algorithm before filtering and sorting the resulting 
set of trails. 

We view a web site as a hypertext system H having two 
components: a directed graph G = (N, E), having finite sets 
of nodes and edges N and E, respectively, and a scoring 
function fj, which is a function from N to the set of non- 
negative real numbers. The directed graph G defines the 
web site topology and is referred to as the web graph; the 
nodes in N represent the web pages and the edges in E 
represent hyperlinks (or simply links) between anchor and 
destination nodes. Figure shows an example web graph, 
taken from the Graph Viz web site 1 , which we will use as 



^www.research.att.com/sw/tool s /graphviz 



a running example. The terms node, web page and URL 
will be used interchangeably. We interpret the score, fJ.{m) 
of a web page m £ TV, as a measure of how relevant m is 
with respect to a given query, where the query is viewed as 
the goal of the navigation session. The Best Trail algorithm 
computes trails scored by a function of these page scores. 

2.1 Selecting Starting Points 

Whilst simply expanding from relevant points is effective, 
we can do better by considering future navigation oppor- 
tunities in our starting point selection. We have created a 
metric for finding good starting points which we refer to as 
the potential gain of a url. That is, the potential for future 
navigation opportunities. Defined as the sum for all depths 
of the product of the fraction of trails to that depth, d and 
the discounting function f(d), it is easily computed by an 
iterative algorithm or by a series of matrix operations [391 
I25| . For larger graphs, we can utilize similar techniques to 
those proposed for the PageRank citation metric [301117111??] . 
For our experiments, we compute potential gain using the 
reciprocal function, f(x) — x^ 1 . 

When restricted to a maximum depth of traversal, d m ax 
the naive algorithm takes time proportional to 0(d m ax-\E\) 
and space proportional to 0(|7V|) to compute potential gain 
values for all nodes in G given that G is sparse. In practice, 
after a brief settling period, convergence to a set of potential 
gain values occurs in a short space of time. Bucketed values 
for potential gain follow a power-law distribution, as is found 
for PageRank and many other web-related phenomena |J. 

2.2 The Best Trail Algorithm 

The pseudo-code of the Best Trail algorithm is shown in 
figure |5] It takes as input a set of starting URLs, S, and a 
parameter, M > 1, which specifies the number of repetitions 
of the algorithm for each input URL. When the algorithm 
terminates it outputs a set of trails, B. There are M trails 
in B for each URL in 5*. Each trail is the highest ranking 
trail contained within the navigation tree expanded from a 
single starting node. A navigation tree is a finite subtree of 
the possibly infinite tree generated by traversing through G, 
the root of which is a member of the set of starting points. 
Manipulating sets of navigation trees has a filtering effect on 
the set of starting points, reducing the rank of nodes which 
are isolated from other relevant documents and from which 
navigation is problematic. Returning trails from separate 
trees also has the effect of removing highly similar trails 
before further filtering is required. 

Starting from each node in S, the algorithm follows links 
from anchor to destination according to the topology of the 
web site. At each stage of the traversal, one of the tips (the 
leaf nodes of the navigation tree) is chosen for expansion. 
The destination node of each outlink whose source is rep- 
resented by the chosen tip is assigned a new tip which is 
added to the navigation tree, along with a computed trail 
score. Previously visited nodes in the web graph will result 
in distinct nodes in the navigation tree, with identical page 
scores but different trail scores. Figure |3] shows an example 
navigation tree based on the web topology shown in figureQ 

The algorithm has a main outer for loop which computes 
the best trail for each URL. The second loop recomputes the 
best trail M times. The two innermost loops comprise the 
exploration and convergence stages of the algorithm, both of 
which expand the navigation tree - from which the best trail 
is selected by the bestQ function. The number of iterations 



Algorithm 1 (Best_Trail(5,M)). 

1. begin 

2. foreach u € S 

3. for i = to M do 

I {«}; 

5. for j = to Iexpiore do 

6. t <— select(D); 

7. D <— expand(D,t); 

8. end for 

9. for j = tO Iconverge do 

10. t <- select(D,df,j); 

11. D <— expand(D,t); 

12. end for 

13. B <- B U {best(D)} 
14- end for 

15. end foreach 

16. return B 

17. end. 



Figure 2: The Best Trail Algorithm. The algorithm 
takes two arguments. M is the number of repetitions 
and S is a set of starting URLs. 



in the exploration phase is set by Iexpiore, whilst the number 
of iterations in the convergence phase is set by Iconverge- 
During the exploration phase, the selectQ function selects a 
tip to expand where the probability of a tip t being selected 
is given by 



P(Di,t) 



Pit) 



ELi pi** 



where p is a scoring function for the trail, making the 
probability of any node being selected directly proportional 
to its score. During the convergence phase, the probability 
of a node t being selected is dependant only on its relative 
rank, r(t), in the ordered set of candidate tips, and is given 
by 



P(Di,t,df,j) 



df 



where j is the number of completed convergence iterations 
and < df < 1 is a discrimination factor. The discrimina- 
tion factor allows us to discriminate between "good" trails 
and "bad" trails by reducing the influence of trails with low 
scores. Thus during the convergence stage "better" trails 
get assigned exponentially higher probability. Setting df 
equal to 1 would imply a uniform random selection, whilst 
as df tends towards 0, the behaviour of the algorithm tends 
towards that of a best-first approach. The degenerate case 
of the Best Trail algorithm where df — 0, Iexpiore = and 
Iconverge>o is equivalent a simple best-first algorithm. The 
rank of a tip, t, (or of the trail leading to it), denoted by 
r(t), is determined by the tip's position within the ordered 
set of candidate tips. The position of t is determinated by 
comparing trails based upon 

1. The number of query terms matched by the trail end- 
ing at t. 

2. The maximum number of query terms matched by any 
single page in the trail. 



Figure 1: An example Web topology, extracted from a crawl of the web site for the Graphviz project. The 
numbers denote unique Identifiers assigned to all URLs. The gaps in the sequence of IDs are due to URLs 
referenced by the website to pages elsewhere on the web. These URLs are reference, but the textual content 
of the pages is not indexed. The numbers in parentheses denote relevance scores for the query "dotty". 



3. The trail score, p(tk). 

It has been argued that the number of keywords in a query 
that are matched by a document should take precedence over 
other scoring mechanisms, and that the terms for a query 
may be spread across several pages |5J 1261 118| . Ranking the 
trails first upon the number of keywords that are matched, 
incorporates both of these ideas and improves relevance. 

2.3 Scoring Trails 

We compute the relevance or score of a trail, T = Ui, U2, ■ ■ ■ , 
as a function, p, of the scores of the individual web pages of 
the trail. We need a function which encourages non-trivial 
trails whilst discouraging redundant nodes. The following 
functions perform well in this regard: 

1. The sum of the scores of the distinct URLs in the 
trail divided by the the number of pages in the trail 
plus some constant (e.g. 1). We refer to this scoring 
function as sum distinct. This function penalises the 
trail when a URL is visited more than once. It also 
penalises trivial singleton trails and encourages trails 
where every node makes a significant contribution to 
the score. Removing the constant factor leads the ob- 
jective function to return a maximal score in the case 
of a singleton node where that node is the highest scor- 
ing page in the corpus. Scoring functions such as the 
average score or maximum score of a node on a trail 
also suffer from this problem. 



2. The discounted sum of the scores of the URLs in the 
trail, where the discounted score of Ui, the URL in the 
ith position in the trail, is the score of Ui with respect 
to the query multiplied by 7 and raised to the power 
of i — 1, where < 7 < 1 is the discount factor. 

3. The weighted sum of discounted scores, where the ad- 
ditional weighting is achieved by discounting each URL 
according to its previous number of occurrences within 
the trail. The weighted score of T is given by 

p(T) = weighted(T) = ^ p(U t ) 7 <_1 <T C<) 

i=l 

where c(i) = \{Uj\j < i A Uj = Ui}\ and 5 is a second 
discounting function, which reduces the importance of 
nodes with equal content. We note that although i = j 
implies Ui = Uj, Ui — Uj does not imply i = j. Two 
distinct nodes may be considered equal if they have 
equal content, determined in advance using checksum 
of page contents and comparing likely candidates. This 
definition of node equality can easily be extended to 
refer to near-duplicate documents I51l35|. 

Figure 0] shows examples of score shows how the trails 
in the navigation tree (figure [3J would be scored after two 
expansions (of tips 1 and 3). The examples shown in this 
paper are constructed by computing two trails from each 
starting point - one scored using the sum distinct metric 
and one using the weighted sum. 




Figure 3: An example navigation tree based upon the site structure shown in figure^ Each node is annotated 
with a unique tip id, a URLid, with the corresponding URL also shown. Red ellipses denote candidate tips for 
expansion. The tip numbers are assigned in sequence during the iteration of the algorithm. In this example, 
the tips numbered 1, 3, 9, 5 and 24 were expanded. 



Tip 


Weighted Sum 


Sum Unique 


1 


1.8076 


0.9038 


2 


3.2593 


1.2477 


3 


6.5056 


2.6905 


4 


1.8076 


0.6025 


5 


3.6534 


1.4230 


6 


1.8076 


0.6025 


7 


1.8076 


0.6025 


8 


1.8076 


0.6025 


9 


7.5940 


2.5018 


10 


6.5056 


2.0179 


11 


6.5056 


2.0179 


12 


6.9194 


2.2018 



Figure 4: Table showing trail scores using Weighted 
Sum and Sum Unique. Example trails scores. The 
high score associated with the first trail has a useful 
control in forcing the most relevant pages to the 
forefront of the display. Merging trails with common 
roots gives a good ordering to the display, as can be 
seen in figure [5l 



2.4 Sorting and Filtering 

The returned set of trails is unsorted and may contain 
redundant information. To sort the trails would appear to be 
trivial - we simply apply the same rules of sorting by number 
of keywords matches and then by the trail score. However, 
we have more than one mechanism for scoring trails, and 
we can compute trails in different navigation trees using 
different functions. We can sort the resulting trails using 
a set of scoring functions, F, by specifying that a trail, Ti 



should be ranked higher than a trail T2 if : 

V /(gO V f(T 2 ) 

f? F /( T + f( T *) ftp /(Tl) + /(Ta) 

We can improve results by removing redundant trails and 
redundant sections within trails. To achieve this, we need 
to define precisely what is meant by a redundant trail. We 
say that a trail T\ subsumes a trail T2 if and only if all the 
pages in T2 are contained in Ti. A trail, ti is removed from 
a result set, r if and only if there exists a trail ti G r such 
that t-2 subsumes t\ and pit-i) > p(ti). Within a trail T, we 
consider a page, ti to be redundant if and only if the page 
can be removed whilst still leaving a valid trail through the 
web site topology (i.e. if ti is the last node of the trail or 
G E and the information contained on page t is 
either not relevant or contained in a previous page (i.e. if 
p(t) = or 3j tj = tiAj < i). These definitions were arrived 
at as the result of several experiments and typically remove 
trivial reorderings and irrelevant content. 

Finally, the trails with common roots are merged into a 
tree and presented in the NavSearch UI 12 II . shown in fig- 
ure |5] Two other interfaces have been 

developed for displaying these trails - a flat TrailSearch in- 
terface similar to that used by traditional search engines for 
displaying linear results and a GraphSearch interface which 
displays the results in the form of a graph 141) 

3. EVALUATION 
3.1 A Case Study 

A case study was performed into the use of the naviga- 
tion engine on the Birkbeck School of Computer Science and 
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Figure 5: Screenshot showing the presentation of 
results for the query "dotty" on the topology shown 
in figure [Tl 



Information Systems (SCSIS). Queries were taken from are- 
cent log file and analysed. The chief results of the analysis 
are presented along with examples. 

The trails provide relevant information. For example, re- 
sults for the query "andrew" find the home pages of Andrew 
Bielinski, Andrew Watkins and Andrew Mair. For the query 
"application form", the first trail identifies the application 
form for the MSc E- Commerce course and the second iden- 
tifies the application form required for the undergraduate 
program (figure||jj. The first two trails for the query "xml", 
shown in figure |7| give brief tours of an XML tutorial, al- 
ways linking to external resources containing a great deal 
of relevant information. The third trail provides an expla- 
nation of XML namespaces connected to hub with lots of 
XML references. The use of Potential Gain in the start- 
ing point selection encourages such hubs to be chosen. The 
fourth trail details the use and history of XML as a markup 
language and it's relationship to SGML. Subsequent trails 
describe the Information Technology (IT) applications mod- 
ule on XML. 

However, relevant content can be found with conventional, 
linear, search engines. More important is that the trails pro- 
vide context to show associations and to help disambiguate 
the meaning of keywords and page descriptions. For exam- 
ple, the structure of the trails for the query "andrew" shows 
Andrew Bielinski to be a research student under the supervi- 
sion of Mark Levene and that Andrew Mair is (although not 
a member of the department) associated with the BSc Infor- 
mation Systems and Management course. Similarly, for the 
query "neural network" , the first trail shows the course "Ar- 
tificial Intelligence & Neural Networks" linked to the home 
page of Chris Christodoulou who teaches the course. Chris 
Christodoulou is the SCSIS expert on nueral networks. The 
second trail leads from his home page to the only one of his 
papers, "A Spiking Neuron Model: Applications and Learn- 
ing" linked to from his home page. The user posing the 
query "exam papers" was almost certainly a student look- 
ing for past papers for revision. Figure [SJshows that the first 
two trails provide exactly that. The second trail shows that 
the papers relate to the module "Developing Internet Ap- 
plications". There are suprisingly few past papers available 
on the SCSIS site and the remaining trails for this query 



B MSc E commerce (Technology) 
Enquiry Form 

■fi IT APPLICATIONS 
Lfiugfbrrn2003.pdf 

■fi IT APPLICATIONS 
Lfi ugfbnn2003.pdf 

■B Diploma in IT Applications 
4l Enquiry Form 

■fi IT APPLICATIONS 
Lfiugfbrm2003.pdf 

D MSc Computing Science flilltime FAQ 
4} Enquiry Form 

-B MSc Computing Science parttime FAQ 
4i Enquiry Form 

I] IT APPLICATIONS Assessment FAQ's 
■fi IT APPLICATIONS 
*-S ugfunn2u03.pdf 

■fi IT APPLICATIONS 
ugfbmi2003.pdf 

<-B IT APPLICATIONS Assessment 
■S IT APPLICATIONS 
Lfi ugfbmi2003.pdf 

■fi IT APPLICATIONS 
Lfi ugfbrm2003.pdf 



Figure 6: Trails found for the query "application 
form" on the SCSIS site. 



details relating to arrangements for sitting exams for that 
summer. The context provided by the trails makes it easier 
to distinguish between the two types of result. 

Unfortunately, the contextual information can be lost when 
inadequate short titles are presented to describe the pages. 
For example, in figure |7] it is impossible to tell any differ- 
ences between the page which share the title "IT APPLI- 
CATIONS". Similarly, for the query "accomodation" (sic), 
there are many different pages shown in the trails, all of 
which relate to the Web Dynamics workshop and contain 
the search term, but there is no means to discriminate be- 
tween them. The authors of the pages made no changes in 
the hi or title tags by which to identify the differences. 
The most appropriate title is contained in a later h3 tag. 

The query "accomodation" also highlights another major 
problem - spelling errors are not corrected. Minor user errors 
or parsing errors in the software introduce significant errors 
in the presented trails. Similarly, examples such as "birkbol 
programmes", "infirmation systems" and "Information En- 
ginerring" highlight the failure of users to construct mean- 
ingful, accurate queries 1371 . 

Overall, the filtering operations appear to work well at 
reducing redundant information without destroying contex- 
tual information. However, redundant information appears 
commonly when near-duplicate documents cause separate, 
highly similar, trails to be created. For example, in figure|S| 
pages entitled "IT APPLICATIONS" are distinct but differ 
only by the inclusion of an irrelevant "assessment" section. 
This small difference causes the creation of 2 separate trails. 
This can be fixed with the application of near-duplicate de- 
tection algorithms 151 1351 . 

The link structure can be broken when the crawler-based 
engine fails to identify all the possible links. This can hap- 
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extensible Markup Language (XML) 

■fi Links to more information 
Hlxml 

H) Applications of XML 

Zl extensible Markup Language (XML) 
H) XML Overview 
Hj extensible Markup Language (XML) 
Links to more information 
xml 

■B Resources for CSMINT 
H! xrnlns.htm 

D Diploma in IT Applications, XML 
Hi Markup Languages 

-S^Worid Net's Electronic Commerce Course Page 

■fi IT APPLICATIONS 
Hi IT APPLICATIONS 
H) IT APPLICATIONS 
Hi IT APPLICATIONS 
HI cplanxm.pdf 

y IT APPLICATIONS 
■B IT APPLICATIONS 
La IT APPLICATIONS 
Vfi cplanxm.pdf 

■fi IT APPLICATIONS 
Hi IT APPLICATIONS 
HI cplanxm.pdf 

■B IT APPLICATIONS Coursework 2002D3 
Hi IT APPLICATIONS 
Hi IT APPLICATIONS 
Hi cplanxm.pdf 

y IT APPLICATIONS 
HI IT APPLICATIONS 
Hi IT APPLICATIONS 
HI cplanxm.pdf 



Figure 7: Trails found for the query "xml" on the 
SCSIS site. 



pen for several reasons - malformed URLs, conservative robot 
exclusion policies 12 1 1 , javascript links and CGI forms. For 
example, the link between rstudentperson.asp?name=bielinski 
and Andrew Bielinski's home page is missing, as are the 
links from all pages in the SCSIS site to the home page, 
HUj'WW KiouTs^sl (research and seminars pages. Similar be- 
haviour found with the output of Content Management Sys- 
tems (CMSs) such as Vignette or Documentum. The long- 
term solution to this problem is to tie the trail engine into 
a better IR system and offer interfaces to the main CMSs. 
For the current research prototype this is not feasible, but 
would be essential if the navigation engine was to be devel- 
oped fully. 

The conclusion that can be drawn from this analysis is 
that the trails found by the navigation engine are useful, 
but the overall utility of the system is being limited by prob- 
lems with related modules - namely IR, near-duplicate de- 
tection and short title generation. Given all these problem, 
the overall performance of the system is highly promising. 
However, to truly test the system's effectiveness requires an 
independent test with real users. 

3.2 A User Study 

In order to assess the usefulness of the NavSearch interface 



Past Exam Papers 

kl exam2001.pdf 

■fi Developing Internet Applications 
Past Exam Papers 
Lfiexam2001.pdf 

-3^MSc Computing Science parttime FAQ 

2] MSc course arrangements 2002 

Notes fbr students of the MSc Computing Science on exams and coursework 
H) Notes for students of the MSc Computing Science on projects 

-3^MSc Computing Science fulltime FAQ 

Zl Solutions to Exercises 
■fi coursebook02.pdf 

D BSc Information Systems Management Bulletin Board 2002/2003 
^-B BSc Information Systems Management Bulletin Board 
^-fi exams_ug.shtml 

-2bbkcs-99-ll.ps 



Figure 8: Trails found for the query 
on the SCSIS site. 
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and prove the hypothesis that "a trail-based search and nav- 
igation engine improves users' navigation efficiency", Mat- 
Hassan and Levene conducted a usability study. The results 
they obtained from the study revealed that users of the nav- 
igation engine performed better in solving the question set 
posed than users of a conventional search engine |27| . 

Users were given two sets of information seeking tasks to 
complete based upon the pages in UCL's official Web site. 
Three different search tools were evaluated, one of which was 
the navigation engine with the NavSearch interface. The 
others were Compass (UCL's official site search engine) and 
Google's university search of UCL 2 . Subjects were asked to 
answer two sets of questions, designed to be at the same 
level of difficulty, using either NavSearch and Google or 
NavSearch and Compass. The question sets were formu- 
lated so that all the questions fell within one of five types : 
fact finding, judgement questions, comparison of fact, com- 
parison of judgement and general navigational questions. 

Most of the subjects assigned to use Google were more op- 
timistic about the initial likelihood of completing the task, 
whilst those subjects assigned to use NavSearch were ini- 
tially more reserved and pessimistic. None of the subjects 
had had any previous experience with NavSearch and famil- 
iarity was identified as the main factor in favour of Google's 
linear interface model. Users were reported to have "found 
the interface quite intimidating" considering it a "radical 
shift" from the conventional layout and format of results. 

The interfaces were assessed according to users' comple- 
tion time, the number of clicks employed, the number of cor- 
rect answers found by the subjects and the confidence and 
satisfaction levels expressed by the subjects. When asked to 
compare NavSearch with Google or Compass, subjects ex- 
pressed a much higher degree of confidence in their ability to 
complete future tasks, a higher degree of satisfaction with 
NavSearch with regards to the completion of tasks and a 
higher degree of satisfaction completion with regard to navi- 
gation and the display of results. Users stated that "showing 
link relationship helps" and that the system provided "use- 
ful trails" which gave "an indication of the pages already 
looked at and the pages that might be useful to look at". 
96% of the study's subjects chose NavSearch over Google 
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and Compass as their preferred search engine. Mat-Hassan 
and Levene concluded that "the proposed user interface does 
indeed provide effective information retrieval assistance" . 

4. IMPLEMENTATION 

In this section we give a brief outline of the architecture re- 
quired to support trail finding and details of the algorithm's 
implementation. 

Each node, page or URL is assigned a unique ID. IDs are 
32-bit signed integers assigned in sequence (from 1) to each 
URL such that any two identical URLs will have an identi- 
cal ID. The mapping between URLs and IDs is performed 
using Berkeley DB files I38| . Each page is associated with a 
relevance score, determined using tf.idf measures although 
they may be computed using any information retrieval met- 
ric |34l . Given a set of relevances and a graph in this 
form, we compute the best trails by running the traversal 
stages in a separate threads for each starting point. 

There are many ways to access relevance data in constant 
time - either through array lookups or hashtables, depending 
on the size of the webcase. The graph is stored using the 
URL ids as references. Many strategies have been presented 
for returning sets of inlinks and outlinks from large graphs 
with appropriate time-space trade-offs 151 1321 15). 

At each step of the expand and converge process we must 
select a tip for expansion based upon the probability dis- 
tribution described in section [5] These distributions have 
been carefully selected to allow the use of binary trees for 
storing this trail score information. We can implement this 
efficiently by using a table describing the tip selection tree 
at each stage, reducing the object creation overhead. As- 
sociated with each tip is the sum of all relevances for all 
descendants, denoted as the subscore, s, and the total num- 
ber of descendants which are referred to as the subcount, c. 
Figure |5] shows the table storing the tips of the navigation 
tree shown in figure [3] 
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SubCount 
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7.5940 




12 
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10 


6.5056 




11 


13.0112 


2 


11 


6.5056 






6.5056 


1 


12 


6.9194 






6.9194 


1 



Figure 9: Table showing candidate tips for expan- 
sion. SS is the sum of the scores for the current 
node and all descendants and SC is the number of 
active nodes reachable from that node. It should be 
noted that the nodes in this tree represent tips and 
should not be confused with either the nodes of the 
graph or the navigation tree produced by the Best 
Trail. 

When selecting a tip to expand, a random number be- 
tween and x is selected where either x is the subscore or 

x = y d r {th)jk 



which can be computed in constant time by applying the 
known result for the sum of a geometric series 3 . At each 
step in the subsequent traversal, this process is repeated for 
the nodes to the left and right of the current node, adjust- 
ing x and y appropriately. Thus, the interval in which the 
selected value lies can be chosen and a direction selected. 
Once completed a single tip will remain, which is then ex- 
panded. For example, in an expansion iteration, the process 
would start with the selection of a random number between 
and 49.9809. If the number 49 was chosen, the process 
would proceed to the right. If the number 35 was chosen, 
the process would proceed to the left. 

4.1 Complexity 

It has been shown how the step select(Di,df,j) can be 
implemented to run in time 0(log(n)) where n is the num- 
ber of candidate tips. The function bestQ has the same time 
complexity, but is slightly simpler in that each iteration is to 
the left of the current node. Hence, the worst case complex- 
ity of algorithm using this implementation can be given 
as 0(KMI 2 (3 2 ) where I = I exp i ore + Iconverge and (3 is the 
maximal outdgree of any link in E. This can be broken 
down as follows: 

1(3 as the worst-case insertion time for a tip. This factor 
emanates from the fact that the tree of tips may be- 
come a linked list if all new tips are added to the same 
part of the tree. This might occur in the simple case 
of nodes having identical scores, so these scores are bi- 
ased using tiny random numbers to adjust the rank. 
The magnitude of these adjustments means that they 
affect only the speed of the operation, not the end re- 
sults. 

(3 representing the number of potential tips which may be 
added to the candidate set at each iteration. This 
number would always be added on a fully connected 
graph, but graphs based upon Web data are very sparse 
and this will never occur in practice. 

KM I as the maximum number of iterations the Best Trail 
may take to find the given trails. 

In practice the tree of tips is unlikely to be skewed to 
such a degree. Nor is the graph likely to be fully-connected. 
However, if the average-case complexity is performed by sub- 
stituting the average outdegree, the results are still inaccu- 
rate. Using the weighted average outdegree better models 
the expansion of the navigation tree during the expansion 
and convergence phases. The weighted outdegree, W, of a 
node, n, is defined as the product of the number of out- 
links (n, x) from that node and the proportion of links in 
the graph which point to that node "'^"^ . It is assumed 
that all links are as likely to be followed as any others, given 
a sufficient number of queries. It should be noted that, when 
expanding a navigation tree, the number of potential trails 
to a depth of d is roughly equal to Y^-=i where w de- 
notes the weighted average outdegree of a graph. Given that 
(3 is the weighted average outdegree, the average case com- 
plexity can be given as 0(Jf M J/3 log (J/3)). Using binary 
trees the average-case complexity of the expand operation 
is 0(/31og/3J) since there are, on average, (3 elements to be 
added to the list of candidate tips and the complexity of 



operation to insert these new candidates is equal to that of 
the select function - 0(log/37). 

5. EXPERIMENTAL RESULTS 

We have conducted numerous experiments to test the be- 
haviour of the algorithm and explore the effect of the various 
parameters which control it. These were mostly performed 
on crawls of the Birkbeck website, the school of computer 
science and information systems website and the JDK 1.4 
javadocs, primarily due to the abundance of query informa- 
tion available to us. 

Behaviour of the algorithm is controlled by the parame- 
ters df, Iexpiore, Iconverge, M and the set of starting points 
{Uo, Ui, . . . , Uk}- As we would expect, increasing the value 
of either of the parameters I exp i or e or Ieonverge produces 
higher scoring trails on average (figure Unsuprisingly, 
increasing Ieonverge finds the local limit of the trail score 
faster than increasing Iexpiore, as shown by the sharp rise 
at the very start of the curve. Perhaps more suprising is 
the behaviour when altering the ratio between Iexpiore and 
I converge- Increasing Iexpiore whilst decreasing Converge in- 
creases the scores of the resulting trails if we measure the rel- 
evance using sum distinct but decreases the trail score when 
calculated using the weighted sum (figure UTt . The balance 
between the values Iexpiore and I CO nverge can be tuned to re- 
flect the importance of the two metrics. Increasing the value 
of M is less effective, as repeated exploration from the same 
node causes many of the expansions to be duplicates of those 
performed in other trees. We can use the multi-treaded en- 
vironment better by expanding from a greater number of 
starting points, as shown in figure Il2l 




Starting Points 



Figure 12: Increasing the number of starting points 
increases the score for trails, by allowing a greater 
number of opportunities for discovery. Trail sets are 
truncated to the same size. 

In order to evaluate the effectiveness of the Potential Gain 
metric in improving trail scores, we analysed the scores of 
trails found by traversing the graph from starting points 
selected by combining the tf.idf IR measure, fj,(p), of a page 
p with the page's potential gain, Pg(p) in several different 
ways. Comparisons were also made to test the effectiveness 
of a simple outdegree count, Out(p) and of Kleinberg's hub 
metric |2U) . The results showed that, relative to the baseline 
of selecting according to /u(p), a significant improvement is 
achieved by taking the highest scoring pages when scored 
using n(p)Pg(p) or /i(p) log Pg(p). Suprisingly, the simple 
metric fj,(p) log Out(p) also performed well for the task of 



starting point selected whilst Kleinberg's metric performed 
badly. 

6. RELATED WORK 

Many graph traversal and path-finding algorithms have 
been developed over the last 50 years and it is not unrea- 
sonable to question the development of a new one. We will 
consider the effects of a few of them. A depth-first traversal, 
for example is unsuitable for trail finding as it may tend to- 
wards "black-holes" from which there is no escape. It is con- 
sidered unsuitable for crawling for similar reasons. Breadth- 
first search is non- viable for anything other than very short 
trails, due to the exponential growth of the tree. A best-first 
search is possible but will struggle in situations where the 
best pages are separated by content which is less relevant - 
exactly the situations where automated navigation is most 
needed! Another approach that has been used effectively 
for computing solutions to the Travelling Salesman Prob- 
lem (TCP) is Ant Colony Optimization (ACO) [II]. Each 
"ant" is an agent which uses a greedy heuristic to follow a 
trail based upon the weight of links and the presence of a 
"pheromone". This pheromone is laid by ants following a 
path, based upon the length of the final result. Our own ex- 
periments have provided anecdotal evidence that the Best 
Trail algorithm out-performs ACO for web-site trail finding, 
although the ACO system appears to out-perform the Best 
Trail in finding solutions to TSP. 

Several systems have allowed the manual construction of 
trails. Sillitoe et al. |Mri) proposed a system for manipulating 
trails, complete with forks and subtrails. They discussed a 
database backed scheme for storing and retrieving the infor- 
mation. Furuta et al. 1151 developed a system for authoring 
modifying and re-using Walden's paths - guided tours which 
could be used in a teaching environment. WebWatcher ad- 
vises users on navigation possibilities by highlighting links 
as they browse. This forms a trail over time, but the link-at- 
a-time approach does not allow the user to see the context 
initially. We agree with Joachims et al.'s I18| belief that 
"in many cases only a sequence of pages and the knowl- 
edge about how they relate to each other can satisfy the 
user's information need", but extend this to compute and 
show complete sequences in advance. Bernstein's approach 
to constructing trails was to ask the user to "choose an inter- 
esting starting point and ask the apprentice to construct a 
path through related material" . The tours were constructed 
via a best-first page finding scheme using document similar- 
ity measures |I]. 

The concept of Information Units, presented in |26| also 
attempts to break away from the single page model, return- 
ing small clusters of linked pages answering the user's query. 
The returned units may be more compact than the trails re- 
turned by the best trail, and cover situation which cannot 
be handled using trails, but the returned results are not 
navigable, nor has there been sufficient consideration to the 
display of the results and subsequent user interaction. The 
Cha-Cha system I1UI presents results in a similar manner to 
the NavSearch interface and shows results in context, but 
the scoring is only conducted at the page level, the trails 
leading to the page are chosen as the shortest paths, not 
those with informative content. 

Several metrics have been proposed for selecting nodes in 
search results which relate to the issue of starting point selec- 
tion. The most famous, the PageRank citation |3UI only con- 
siders the effect of incoming links, whilst Kleinberg's Hubs 
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Figure 10: Increasing either (a) the number of exploration iterations or (b) the number of convergence 
iterations, increases the score of the returned trails. When increasing I explore, the algorithm slowly tends to 
a limit, whilst exploring the solution space. When increasing I C onver S e (and leaving I e x P iore constant e.g. as 
in this example), the algorithm quickly tends to a limit. 





(a) Sum Distinct 



(b) Weighted Sum 



Figure 11: Increasing Iexpand, whilst decreasing I c , 



increases the resultant trail score when calculated 



using sum distinct but decreases the resultant trail score when calculated using the weighted sum The graphs 



show values for < Iexp 



< 100 where I CO nve 



100 - I, 



explore' 



and Authorities metrics and extensions of it only consider 
the effect of single links in each direction, whilst potential 
gain will consider the effect of more distant pages 1201 1221 . 



7. CONCLUSIONS AND FUTURE WORK 

We have presented an algorithm for finding trails across 
the graph of linked pages in a web site. Inspired by Bush's 
memex, these trails provide a structure to the returned re- 
sults and provide users with contextual information not pro- 
vided by traditional search facilities. 

Although site-search is of vital importance, and deserves 
special attention as an area of research separate from global 
search engines, it would be highly beneficial to allow full 
web-scale trail finding. Unfortunately, the current architec- 
ture will not scale to full-size web data. However, we can 
break the problem down. Conventional search engines do 
not index the full content of the web. They select some 
subset to index based on usage statistics, link analysis or 
the output of dedicated crawling algorithms designed to se- 
lect high-quality nodes first 1311 1111 . We can select a subset 
of this on which to perform trail computation. For exam- 
ple, we could compute trail information on high-profile or 



highly-popular sites and return single-page results for the 
remaining indexed pages. An alternative strategy is to con- 
struct a restricted graph based upon the search results for a 
given query, over which trails could be constructed. Whilst 
this approach would suffer less scalability problems, it might 
suffer similar performance issues to Kleinberg's approach of 
expanding the search results 1201 . 

The work presented here has many applications in other, 
non- hypertext areas. We have built a system called Db- 
Surfer, which applies these ideas to solve the join discovery 
problem in relational databases by finding trails through the 
graph of foreign key dependencies. We have also built sys- 
tems for finding trails in program documentation |41| and 
source code. In this last example, the results are achieved 
by combining trails discovered on several graphs, where each 
graph corresponds to interactions in one of five different 
coupling types (Inheritance, Interface, Aggregation, Param- 
eter Type and Return Type) 1401 . In these examples, the 
problems identified earlier are largely eliminated and the 
true potential of the trail-based navigation engine can be 
clearly seen. The navigation problem is widespread and oc- 
curs in all type of software system. Alan Cooper describes 
the phenomenon as "uninformed consent", when "at each 



step the user is required to make a choice, the scope and 
consequences of which are not known" |13| . Providing key- 
word search and trail discovery over the graph of options 
available at the application or operating system level could 
greatly enhance user experience. For example in Microsoft 
Windows, the query "active desktop" might return a path 
Start — > Settings — > Control Panel — > Folder Options. 
Finally, we believe that the algorithm may have applications 
in the fields of game playing and optimization problems. 
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